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ABSTRACT 

We  study  the  steady  states  of  a  system  in  which  players  learn  about  the 
strategies  their  opponents  are  playing  by  updating  their  Bayesian  priors  in 
light  of  their  observations.   Players  are  matched  at  random  to  play  a  fixed 
extensive -form  game,  and  each  player  observes  the  realized  actions  in  his  own 
matches,  but  not  the  intended  off -path  play  of  his  opponents  or  the  realized 
actions  in  other  matches.   If  lifetimes  are  long  and  players  are  very 
patient,  the  steady  state  distribution  of  actions  approximates  those  of  a 
Nash  equilibrium. 


Keywords:   Learning,  Nash  equilibrium,  multi-armed  bandits 
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1 .   Introduc t  ion 

We  study  an  extensive- form  game  played  repeatedly  by  a  large  population 
of  players  who  are  matched  with  one  another  at  random,  as  in  Rosenthal 
[1979].   As  in  Fudenberg  and  Kreps  [1991],  we  suppose  that  players  do  not 
know  the  strategies  their  opponents  are  using,  but  learn  about  them  from 
their  past  observations  of  the  opponents'  play.   At  the  end  of  each  period, 
each  player  observes  the  realized  actions  of  his  own  opponents;  players  do 
not  observe  play  in  the  other  matches.   The  key  feature  of  such  learning  is 
that  the  stage  game  may  have  information  sets  that  are  not  reached  in  the 
course  of  play,  and  observing  the  opponents'  realized  actions  does  not 
reveal  how  the  opponents  would  have  played  at  the  unreached  information 
sets.   Thus,  even  if  the  players  play  the  stage  game  many  times,  they  may 
continue  to  hold  incorrect  beliefs  about  the  opponents'  play  unless  they 
engage  in  a  sufficient  amount  of  "experimentation". 

In  our  model,  the  total  size  of  the  population  is  constant,  while 
individuals  enter  and  leave  after  a  finite  number  of  periods.   Entering 
players  believe  that  they  face  a  fixed  but  unknown  distribution  of 
opponents'  strategies.   They  have  non-doctrinaire  prior  beliefs  about  this 
distribution  which  they  update  using  Bayes  rule,  and  their  actions  in  each 
period  are  chosen  to  maximize  the  expected  present  value  of  their  payoffs 
given  their  beliefs.   Steady  states  always  exist.   When  lifetimes  are  short, 
steady  state  play  is  mostly  determined  by  the  priors,  as  players  have  little 
chance  to  learn  from  experience.   Steady  states  need  not  be  Nash  equilibria 
if  lifetimes  are  long  but  players  are  impatient,  because  players  may  not 
learn  their  opponents'  off -path  play.   Instead,  steady  states  with  impatient 
players  correspond  to  the  "self -confirming  equilibria"  introduced  in  our 
[1990]  paper.    Our  main  result  is  that  if  lifetimes  are  long  and  players 


are  sufficiently  patient,  then  expected-value  maximization  implies  that 
plavers  will  choose  to  do  enough  experimentation  that  steady  states 
correspond  to  Nash  equilibria  of  the  stage  game. 

To  motivate  these  results,  fix  an  extensive-form  game  of  perfect 
recall,  and  suppose  that  each  player  knows  its  structure.   Unless  he  has  a 
dominant  strategy,  this  knowledge  is  not  enough  for  a  player  to  decide  how 
to  play;  he  must  also  predict  the  play  of  his  opponents.   Nash  equilibrium 
and  its  refinements  describe  a  situation  where  each  player's  strategy  is  a 
best  response  to  his  beliefs  about  the  strategies  of  his  opponents,  and  each 
player's  predictions  are  exactly  correct.   To  understand  when  equilibrium 
analysis  is  justified  therefore  requires  an  explanation  of  how  players' 
predictions  are  formed,  and  when  they  are  likely  to  be  accurate. 

One  classic  explanation  is  that  an  outside  "mediator"  suggests  a 
strategy  profile  to  the  players,  who  accept  it  unless  some  player  could  gain 
by  deviating.   A  second  classic  explanation  is  that  the  game  is  dominance 
solvable  and  it  is  common  knowledge  that  players  are  Bayesian  rational,  so 
that  introspection  by  the  players  leads  them  all  to  the  same  predictions.   A 
more  recent  explanation,  introduced  by  Aumann  [1987]  and  further  developed 
by  Brandenburger  and  Dekel  [1987],  is  that  predictions  will  be  correct  if 
they  are  consistent  with  Bayesian  updating  from  a  common  prior  distribution. 

This  paper  contributes  to  the  development  of  a  fourth  explanation,  that 
Nash  equilibrium  is  the  result  of  learning  from  past  observations.   In  our 
model,  the  steady  state  distributions  of  strategies  played  approximate  those 
of  Nash  equilibria  if  players  live  a  long  time  and  are  also  sufficiently 
patient.   The  intuition  for  this  is  roughly  the  following. 

If  a  player's  prior  is  not  degenerate,  he  will  learn  a  great  deal  about 
the  results  of  any  action  he  chooses  many  times;  this  is  "passive  learning". 


If  the  player  is  patient,  he  will  choose  Co  invest  in  discovering  what  his 
best  strategy  is.   This  is  accomplished  by  "active  learning"  or 
"experimentation",  meaning  that  the  player  will  sometimes  play  actions  that 
do  not  maximize  the  current  period's  expected  payoff  given  his  current 
beliefs,  so  that  he  may  learn  whether  his  beliefs  are  in  fact  correct. 
Without  experimentation,  players  can  persistently  maintain  incorrect  beliefs 
about  their  opponents'  off -path  play,  which  is  why  steady  states  need  not 
correspond  to  equilibria  unless  players  experiment  sufficiently  often. 

Our  focus  on  active  learning  differentiates  the  paper  from  the 
literature  on  learning  in  rational  expectations  environments,  as  pioneered 
by  Bray  [1982].   In  that  literature,  players  observe  system-wide  aggregates, 
and  thus  have  no  reason  to  experiment.   The  focus  on  active  learning  also 
differentiates  the  paper  from  that  of  Canning  [1990],  which  in  other 
respects  is  quite  similar.   Canning  studies  two-player  simultaneous -move 
stage  games,  and  supposes  that  players  live  forever  but  only  remember  their 
last  T   observations.   He  shows  that  when  lifetimes  are  long,  the  steady 
states  approximate  Nash  equilibria.   We  should  also  mention  the  work  of 
Kalai  and  Lehrer  jMfll^  [1991b]  who  study  Bayesian  learning  in  a  setting 
where  the  same  players  are  matched  with  one  another  in  every  period.   In 
their  model,  unlike  Canning's,  active  learning  is  possible,  but  active 
learning  is  not  necessary  for  their  main  result,  which  is  that  play 
eventually  resembles  that  of  a  Nash  equilibrium  of  the  repeated  game. 

Our  study  was  inspired  primarily  by  Fudenberg  and  Kreps  [1991],  who 
were  the  first  to  emphasize  the  importance  of  active  learning  in  justifying 
Nash  equilibrium.   They  showed  by  example  how  passive  learning  without 
experimentation  can  lead  to  steady  states  that  are  not  Nash.   There  are  two 
main  differences  between  their  work  and  ours.   First,  Fudenberg  and  Kreps 


develop  a  model  of  bounded  rationality,  making  ad-hoc  assumptions  about 
players'  behavior,  while  we  assume  that  players  are  Bayesian  expected- 
utility  maximizers .   Second,  Fudenberg  and  Kreps  analyze  the  dynamic 
evolution  of  a  system  where  all  players  become  more  experienced  over  time, 
and  characterize  the  states  to  which  the  system  can  converge,  while  we 
analyze  the  steady  states  of  a  stationary  system. 

The  steady  state  model  has  several  comparative  advantages.   Our  model 
can  describe  situations  where  the  players'  lifetimes  are  moderate  and  the 
inexperienced  players  have  a  substantial  influence,  although  we  do  not  study 
such  situations  here.   Our  model  is  mathematically  more  tractable,  which 
enables  us  to  solve  for  the  optimal  Bayesian  policies.   Finally,  in  the 
Fudenberg  and  Kreps  model,  players  act  as  if  they  are  facing  a  stationary 
environment  even  though  the  environment  is  not  stationary.   In  the  steady 
states  we  analyze,  the  players'  assumption  of  stationarity  is  justified. 

Both  papers  avoid  the  question  of  global  stability,  for  which  general 

2 

results  seem  unlikely.    Fudenberg  and  Kreps  do  develop  several  notions  of 

local  stability  and  establish  global  convergence  for  the  class  of  2x2 
simultaneous -move  stage  games.   Our  model  is  rich  enough  to  analyze  local 
stability,  but  we  should  point  out  that  such  an  analysis  would  need  to 
consider  the  evolution  of  the  system  when  the  players'  steady  state 
assumption  is  violated. 

2 .   The  Stage  Game 

The  stage  game  is  an  I+l -player  extensive  form  game  of  perfect  recall. 

Player  i  -  I-i-l   is  nature.   The  game  tree  X,   with  nodes   x  €  X,   is 

3 
finite.   The  terminal  nodes  are   z  e  Z  C  X.    Information  sets,  denoted  by 

h  e  H,   form  a  partition  of  X\Z.   The  information  sets  where  player  i  has 

the  move  are   H.  c  H,   while   H  .  =  H\H.   are  information  sets  for  other 
1  -11 


plavers  (or  nature).   The  feasible  actions  at  information  set   h  e  H   are 

denoted  A(h) ;    A.  -  u     A(h)   is  the  set  of  all  feasible  actions  for 
i    h£H . 

1 

plaver  i,  and  A  .  -  u.  .A;   are  the  feasible  actions  for  player  i's 
^   -  - 1    jf^i  ^     ■' 

opponents . 

A  pure  strategy  for  player  i,   s.,   is  a  map  from  information  sets  in 

H.   to  actions  satisfying   s.(h)  G  A(h) ;   S.   is  the  set  of  all  such 

1*1 
strategies.   We  let   s  €  S  ■   x  S  j^   denote  a  pure  strategy  profile  for  all 

i  =  l 

players  including  nature,   and   s.eS.»x.  .S..   Each  strategy  profile 
determines  a  terminal  node   f(s)  €  Z.   We  suppose  that  all  players  know  the 
structure  of  the  extensive  form  --  that  is,  the  game  tree  X,   information 
partitions   H.   and  actions  sets   A..   Hence,  each  player  knows  the  space   S 
of  strategy  profiles,  and  can  compute  the  function  f.   Each  player  i 
receives  a  payoff  in  the  stage  game  that  depends  on  the  terminal  node. 
Player  i's  payoff  function  is  denoted  u. :  Z  -»  R.   We  let  U  be  the  largest 
difference  in  utility  levels,   U  ■  max.     , |u. (z) -u. (z! )  . 

J  l.Z.Z'l        IL 

Let  A(»)   denote  the  space  of  probability  distributions  over  a  set. 

I+l 
Then  a  mixed  strategy  profile  is  a  e  x  ACS^).   For  ease  of  exposition, 

i=l 

we  assume  that  nature  plays  a  known  mixed  strategy  a      ..  .   Our  main  result 
(Theorem  5.1,  about  Nash  equilibria)  extends  in  a  straightforward  way  to  the 
case  where  nature's  move  is  drawn  from  a  fixed  but  unknown  distribution; 
extending  Theorem  6.1  requires  a  modification  of  the  definition  of  self- 
confirming  equilibrium. 

Let  Z(s.)  be  the  subset  of  terminal  nodes  that  are  reachable  when  s. 
1  1 

is  played,  that  is,   z  e  Z(s.)   if  and  only  if  for  some 


s  .  6  S 

-1 


.,  z  -  5"(s).   Similarly,  define  X(s.)   to  be  all  nodes  that  are 


reachable  under  s.,   not  merely  terminal  nodes.   In  a  similar  vein,  let 


H(s.)   be  the  set  of  all  information  sets  that  can  be  reached  if   s.   is 
played.   In  other  words,   h  €  H(s.)   if  there  exists   x  6  X(s.)   with 
X  e  h. 

We  will  also  need  to  refer  to  the  information  sets  that  are  reached 
with  positive  probability  under  a,      denoted  H(o) .   Notice  that  if  a  .   is 
completely  mixed,  then  H(s.,a  .)  -  H(s.),   as  every  information  set  that  is 
potentially  reachable  given  s.   has  positive  probability. 

In  addition  to  mixed  strategies,  we  define  behavior  strategies.   A 
behavior  strategy  for  player  i,   w.,   is  a  map  from  information  sets  in  H. 

to  probability  distributions  over  moves,  so  that  7r.(h)  €  A(A(h));   11.   is 

1*1 
the  set  of  all  such  strategies.   As  with  pure  strategies,   «•  €  11  ■   x  H;  , 

i=l 

and  ?r  .  €  n  .  ■  X.  .  n .  .   We  also  let  T.(a)   denote  the  component  of 
:r.(h)  corresponding  to  the  action  a  e  A(h) .   Finally,  let   fCf)  €  A(Z)   be 
the  probability  distribution  over  terminal  nodes  induced  by  the  behavior 
strategy   jr. 

Since  the  game  has  perfect  recall,  each  mixed  strategy  a.      induces  a 
unique  equivalent  behavior  strategy  denoted  ?r.(»|a.).    In  other  words, 
w.(h|a.)   is  the  probability  distribution  over  actions  at  h   induced  by 
a.,      and  7r.(a|a.)   is  the  probability  of  the  action  a  €  A(h)  . 

For  each  node  x  and  each  player  i,  let   (a  .(i,x)).  ..   be  the 
collection  of  all  actions  of  players  other  than   i   (including  nature)  which 
are  predecessors  of  x.   (Note  that   L   is  equal  to  the  length  of  the  path 
to   X  minus  the  number  of  nodes  in  the  path  that  belong  to  player  i.)   If 
pure  strategy   s.   is  such  that  x  e  X(s.),   player  i  believes  that  the 
probability  of  reaching  node  x  when   s.   is  played  and  the  opponent's 
strategies  are   »r  .   is 


L 

(2.1)       p.(xi^  ,-)  -   n    ^  Aa      (i,x): 
1     -1     2^1         -1   -1 


Notice  the  convention  we  use:   each  node   x   is  assigned  a  number   p.(x|7r  .) 

which  is  the  probability  of  reaching  that  node  if  any   s.   is  played  for 

which  X  €  X(s.)-   Naturally  if  x  ^  X(s.)   the  probability  of  reaching  x 

is  zero,  while   Z  „,   .p.CzlwO-l-   The  effect  of  changing  s.   is  not 
z€Z(s.)  ^1-1  b     b        ^ 

1 

on  the  numbers   p.,   but  rather  the  set  of  nodes  X(s.)   that  can  be 
reached. 

We  now  model  the  idea  that  a  player  has  beliefs  about  his  opponents 
play.   Let  ^.   be  a  probability  measure  over  11  .,   the  set  of  other 
players'  behavior  strategies.   Fix   s..   Then  the  marginal  probability  of  a 
node   X  e  X(s . )   is 

(2.2)  Pj_(x1m^)  -  JPi(xU_.)  M.(d»r_.). 
This  in  turn  gives  rise  to  preferences 

(2.3)  u  (s  /i  )  -u  (s  p  ("l/i  ))  -    X    p  (z|^  )u  (z). 

^  ^   ^     111    1     zeZ(si)   ^    ^   ^ 


It  is  important  to  note  that  even  though  the  beliefs  ^.      are  over 
opponents'  behavior  strategies,  and  thus  reflect  player  i's  knowledge  that 
his  opponents  choose  their  randomizations  independently,  the  marginal  dist- 
ribution p(»]/i.)   over  nodes  can  involve  correlation  between  the  opponents' 
play.   For  example,  if  players  2  and  3  simultaneously  choose  between  U  and 
D,   player  1  might  assign  probability  1/4  to  Jr-(U)  -  »r,(U)  -  1,   and 
probability  3/4  to  t-(U)  -  »r  (U)  -  1/2.   Even  though  both  profiles  in  the 
support  of  /i   suppose  independent  randomization  by  players  2  and  3,  the 
marginal  distribution  on  their  joint  actions  is  p(U,U)  -  7/16  and 


p(U,D)  -  p(D,U)  -  p(P,D)  -  3/16,   which  is  a  correlated  distribution.   This 
correlation  reflects  a  situation  where  player  1  believes  some  unobserved 
conunon  factor  has  helped  determine  the  play  of  both  of  his  opponents.   Since 
the  opponents  are  in  fact  randomizing  independently,  we  should  expect  player 
i's  marginal  distribution  to  reflect  this  if  he  obtains  sufficiently  many 
observations,  but  until  observations  are  accumulated,  the  correlation  in 
p(»|/i.)   can  persist. 

Frequently  n.  will  be  either  a  point  mass  at  »r  .  ,  or  have  a 
continuous  density  g.  over  tt  .  .  In  this  case  we  write  p(x|n-  .). 
u.(x,;r  .),   p(x|g.)   and  u.(x,g.)   respectively. 

3 .   Steady  states 

Corresponding  to  each  player  (except  nature)  in  the  stage  game  is  a 
population  consisting  of  a  continuum  of  players  in  the  dynamic  game.   In 
each  population,  the  total  mass  of  players  is  one.   There  is  a  doubly 
infinite  sequence  of  periods,   ...,-1,0,1,...,   and  each  individual  player 
lives   T  periods.   Every  period   1/T  new  players  enter  the  i   population, 
and  we  make  the  steady  state  assumption  that  there  are   1/T  players  in  each 
generation,  with   1/T  players  of  age   T  exiting  each  period. 

Every  period  each  player  i  is  randomly  and  independently  matched  with 
one  player  from  each  population   i'  »*  i ,   with  the  probability  of  meeting  a 
player  i'  of  age   t   equal  to  its  population  fraction  1/T.    For  example, 
if  T  -  2,   each  player  is  as  likely  to  be  matched  with  a  "new"  player  as  an 
"old"  one.   Each  player  i's  opponents  are  drawn  independently. 

Over  his  lifetime,  each  player  observes  the  terminal  nodes  that  are 
reached  in  the  games  he  has  played,  but  does  not  observe  the  outcomes  in 
games  played  by  others.   Thus,  each  player  will  observe  a  sequence  of 
private  histories.   The  private  history  of  player   i   through  time   t   is 


denoted  y.  -  (s.(l),2(l) s.(c),z(c)).   Lee   Y.   denote  the  set  of  all 


such  histories  of  1 


ength  no  more  than  T.   We  let   t(y.)   denote  the  length 


of  a  history  y.  €  Y. .   New  players  have  the  null  history  0,   and  we  set 
z(0)  -  0. 

A  rule  for  a  player  of  the  i   kind  is  a  map   r.:  Y.  -»  S.   that 
specifies  player  i's  choice  of  pure  strategy  for  each  possible  observation. 
(Note  that  if   t(y.)  -  T,  r.(y  )   has  no  meaning  because  player  i  does  not 
get  to  play  at  T  +  1 . ) 

Suppose  for  the  moment  that  all  players  in  population   i   use  the  same, 
arbitrary,  deterministic  rule   r.,   and  face  a  sequence  of  opponents  whose 
play  is  a  random  draw  from  a  stationary  distribution.   (This  is  in  fact  the 
case  in  the  steady  states  of  our  model.)   We  will  soon  specialize  to  the 
case  where  the  rules  are  derived  from  maximizing  expected  discounted  values 
given  prior  beliefs,  but  it  is  helpful  to  develop  the  mechanics  of  the 
matching  model  before  introducing  that  complication. 

A  steady  state  for  given  rules   r.   specifies  fractions   ^.(y.)   of 
each  population  i   in  each  experience  category  y.   such  that  after  each 
player  meets  one  opponent  at  random  and  updates  his  experience  accordingly, 
the  population  fractions  are  unchanged.   Specifically,  if   fl.  €  A(Y.),   the 
fraction  of  population  i  playing  s.   is 

(3.1)      7^(s.)  -     J2      «i(yi)- 

(yiiri(yi)=si) 

Also  define  6      ^    -  a      ^  .      We  may  then  define  a  map 

f:  x._  A(Y.)  -♦  X.    A(Y.)   by  assigning   f[fl].(y.)   to  be  the  fraction  of 
player  i's  with  experience  y.   after  randomly  meeting  opponents  drawn  from 
S.      The  new  entrants  to  the  population  have  no  experience,  so 


(3,2) 


f;s].(0)  -  1/T. 


The  fraction  having  experience   (y . , r . ( y . ) , z)   is  the  fraction  of  the  exist- 
ing $.iy.)      that  met  opponents  playing  strategies  that  led  to  z.      Noting 
that   f .  [s.,»]  (z)   are  those  strategies,  we  see  that 


(3.3) 


f[^]i(yi.r.(y.).z)  -  S.(y.) 


^k^i^^^k) 


s.iG5-:^[ri(yi),.](z) 


Finally,  it  is  clear  that 


(3.4) 


f[e] .(y. ,s. ,z)  -  0   if  s.    f   r.(y.). 


Definition  3.1:  8   e  x.        A(Y. )   is  a  steady  state  if  fi  -  f[e]. 

To  illustrate  this  definition,  consider  the  game  "matching  pennies", 
with  S..  -  S„  -  {H,T).   Suppose  that   T  -  2   and 

(3.5)  r^(0)  -  H,   r^(H,H)  -  H,   r^(H,T)  -  T 

r2(0)  -  T.   r2(H,T)  -  T.   r2(T.T)  -  H 

(Note  that  we  do  not  need  to  specify  r..(T,«)   or  r„(«,H)   as  such 
histories  never  occur  --  young  player  I's  always  play  H  and  young  player 
2's  always  play  T.) 

In  a  steady  state  we  have: 

5i(H.H)  =  eiiO)82(T,T) 


(3.6) 


^l(O)  =  1/2. 
S2(0)   =  1/2, 


ei(H,T)  =  5i(0)[fi2(0)*«2(H.T)] 
(?2(H,T)  =  ^2(0)[«l(0)+«l(H,H)] 
^2(T,T)  =  62{0)ei{H,T) 
a  system  of  quadratic  equations. 


Computation  shows  that  8    (0)    -   1/2,   e,(H,H)  -  1/10,   fi,(H,T)  -  4/10, 
8^(0)    -  1/2,  8    {ii,T)   -   3/10,   and  ^2^'^''^)  "  ^/lO .   from  which  it  follows 


that   ff^(H)  -  8^(0)    +   fi^(H,H)  -  6/10   and  8 ^{U)   -   8^(^1,1)   -   2/10.   Note 


that  the  average  play  of  the  player  I's  and  the  player  2's  corresponds  to  a 
mixed  strategy  of  the  stage  game,  even  though  all  individuals  in  both 
populations  use  deterministic  rules.   (Canning  [ 1989  ],[ 1990]  makes  the  same 
point) . 

Theorem  3.1:   For  any  rules   r.   and  nature's  moves   ^-r  i  •   ^  steady  state 
exists. 

Proof:   For  given   ^t  i ■   f   is  a  polynomial  map  from  x.        A(Y.)   to 
itself,  and  so  it  has  a  fixed  point.  I 

Given  a  steady  state  6    G   x.  A(Y.),   we  may  easily  compute  the 

population  fractions  6.    e  A(S.)   playing  each  strategy  by  (3.1).   Converse- 
ly, given  the  steady  state  fractions  we  can  calculate  the  experience  levels 
recursively  by 

(3.7)        ^:(y^('0))  -  1/T 


^:(yi.r,(y,).z)  -.^(y.)         2.         \.i  ^(V 

s_ierTl[ri(yi),.](z) 
e'.iy.  ,s.  ,z)   -  0     if  s.    ^   r.  (y.)  . 

If  we  then  recalculate  6'.      using  (3.2)  we  have  a  polynomial  function  f 

mapping  x.    A(S.)   to  itself.   We  may  equally  well  characterize  a  steady 

state  as  a  fixed  point  of   f ,   and  calculate  the  corresponding  fixed  point 

of   f  using  (3.7).   Since  A(S.)   is  much  smaller  than  A(Y.),   this  is  of 

some  practical  importance. 


4.   Value  Maximization  Given  Saves  Stationary  Beliefs 

Our  interest  is  not  in  arbitrary  rules  r.   but  in  rules  derived  from 
maximizing  the  expected  discounted  sum  of  per-period  payoffs  given  exogenous 


prior  beliefs.   More  precisely,  ue  suppose  chat  each  player's  objective  is 

to  maximize 

T 
1-6 


(4.1)         -Lll  E  J:  6^u, 
1-6^       c=l 


where   u   is  the  realized  stage  game  payoff  at   t  and  0  <  6  <  1.   Each 
population  believes  that  it  faces  a  fixed  (time  invariant)  probability 
distribution  of  opponents'  strategies,  but  is  unsure  what  the  true 
distribution  is. 

Population   i's   prior  beliefs  are  over  behavior  strategies.   They  are 
given  by  a  strictly  positive  density  except  over  ir        ,      for  which  the  prior 
is  a  point  mass  at  ir        ("la   ..  )  .   That  is,  the  player  knows  the  probability 
distribution  over  nature's  move.   It  is  important  to  emphasize  that  player 
i's  beliefs  about  player  j  correspond  to  the  average  play  of  all  player  j's 
and  not  the  play  of  a  particular  individual.   As  in  the  matching  pennies 
example  of  the  last  section,  a  mixed  distribution  over  player  j's  play  may 
be  the  result  of  different  subpopulations  of  the  player  j's  playing 
different  pure  strategies. 

For  notational  convenience,  we  suppose  that  all  player  i's  begin  the 
game  with  the  same  prior  beliefs   g. .   All  of  our  results  extend  immediately 
to  the  case  of  finitely  many  subpopulations  of  player  i's  with  different 
initial  beliefs. 

We  let  g.(»|z)   denote  the  posterior  beliefs  starting  with  prior   g. 
after  z   is  observed: 

(4.2)  giC'^.iU)  -  p.(zU..)g.(T_.)/p.(z|g.) 

Let  VV(g.)   denote  the  maximized  average  discounted  value  (in  current 
units)  starting  at   g.   with  K  periods  remaining.   Bellman's  equation  is 


(^.3)        v\g.)    -   "'ax   [(1-^  )u  (s   g  ) 
11    s  j^es  I  Kill 


K      E   Pi^^ISi)  v^-\g  (.jz)) 

'^  zeZ(si)   1111 


where   V°(g.)  -  0,   and  4'^,   -   S  (1-6^''^)  /  (l-S^)  .       Let   s'!^(g.)   denote  a 
11  K  11 


solution  of  this  problem. 


T-t(yi) 


The  optimal  policy   r.(y.)  -  s       (6i(»lyi))-   Note  that  this 

^   ^      i 

section  is  independent  of  the  true  value  of  the  steady  state.    (The  steady 

state  does  influence  the  distribution  of  observations  and  hence  the 

distribution  of  actions  played.)   Thus  by  Theorem  3.1  a  steady  state  exists. 

These  steady  states  are  not  very  interesting  if  lifetimes  are  short. 

For  example  if  T  -  1   the  entire  population  consists  of  inexperienced 

players,  each  of  whom  plays  a  best  response  to  his  prior  beliefs  and  then 

leaves  the  system.   Our  interest  is  in  the  nature  of  steady  states  in  the 

limit  as  lifetimes  T  increase  to  infinity  and  5  goes  to  one. 

For  he  H  .,   a  e  A(h) ,   let  n(a  y.)   be  the  number  of  times  the 
- 1  '  -^  1 

move   a  has  been  observed  in  the  history  y. .   We  define   n(xiy.)   and 
n(h|y.)   similarly  and  set  n(s.ly.)   to  be  the  number  of  times  player  i  has 
played   s . . 

Let   T_.(»|y.)   be  the  sample  average  of  player  i's  observations  about 


his  opponent's  play.   That  is,  for  each  h  £  H  .   and   a  e  A(h) , 

'r_^(a|y^)  -  n(a|y^)/n(h|y^) 

with  the  convention  that  0/0  -  1.   Let  p.(z|y.)   be  the  distribution  on 

terminal  nodes  induced  by  the  sample  averages  n    . ,   that  is 

P.Czly.)  -  p.(zlJr  .(•ly.)).   Since   p.(«lv.)   reflects  the  extensive  form  of 
1   ^1    "^1   '  -1  '■'i  ^1      '■'i 


the  game,  it  is  not  in  general  equal  to  the  sample  average  on  terminal  nodes 
2  e   Z(s.).   For  example,  consider  a  game  where  if  player  1  moves   L,   play- 
ers 2  and  3  observe  player  I's  move  and  simultaneously  choose   H„   or  T„ 
and  H^   or  T-   respectively.   If  the  sample  y.   is  4  observations  at 
(L,H   H-),   1  observation  each  of   (L,H  T.)   and   (L,T  H-),   and  0 
observations  of   (L,!^,!.,),   then 

p.((L,T2,T3)|y^)  -  (l/6)(l/6)  -  1/36 

even  though  there  are  no  observations  on   (L,T„,T-)   in  the  sample.   Since 
player  1  is  certain  that  players  2  and  3  randomize  independently,  he  treats 
the  observed  correlation  in  the  sample  as  a  fluke. 

Let   g.(»|y.)   be  the  posterior  density  over  opponents'  strategies 

given  sample   y. ,   and  let  p.(*|y.)   be  the  corresponding  distribution  on 

k     I 
terminal  nodes.   It  will  often  be  convenient  to  abbreviate  V.(e. (•  y.))   as 

k        k     I  k  I  I 

V.(y.),   s.(g.(»  y.))   as   s.(y.)   and  u. (s . , g. (•  y . ) )   as   u.(s.  y.). 

5 .   Active  Learning  and  Nash  Equilibrium 

Our  goal  is  to  show  that  if  players  are  patient  as  well  as  long  lived 
then  steady  states  approximate  play  in  a  Nash  equilibrium.   Theorem  5.1 
establishes  this  for  the  case  where  lifetimes   T  go  to  infinity  "more 
quickly"  than  the  discount  factor  tends  to   1.   We  do  not  know  whether  the 
conclusion  of  the  theorem  holds  for  the  other  order  of  limits. 

Theorem  5.1:   For  any  fixed  priors   g.   there  is  a  function  T(5)   such  that 

if   5   -♦  1   and  T  >  T(5  ) ,   every  sequence  of  steady  states  8        has  an 

m  mm        j        ^  j 

accumulation  point  5,   and  every  accumulation  point  is  a  Nash  equilibrium. 

An  accumulation  point  exists  by  compactness;  the  interesting  part  of 
the  theorem  is  that  the  accumulation  points  are  Nash  equilibria.   The  idea 


of  the  proof  is  simply  rhat  players  do  enough  experimentation  to  learn  the 
true  best  responses  to  the  steady  state.   The  obvious  argument  is  that  if  a 
player  is  very  patient,  and  a  strategy  has  some  probability  of  being  the 
best  response,  the  player  ought  to  try  it  and  see.   However,  the  simple 
example  in  Figure  1  shows  a  complication.   Even  a  very  patient  player  may 
optimally  choose  to  never  experiment  with  certain  strategies.   Moreover, 
these  unused  strategies  need  not  be  dominated  strategies  in  the  stage  game. 
In  the  game  of  Figure  1,  if  player  1  assigns  a  low  probability  to  player  2 
playing  L„ ,   his  current  period's  expected  payoff  is  maximized  by  playing 
L..  .   Now,  if  player  1  is  patient,  he  may  be  willing  to  incur  a  short-run 
cost  to  obtain  information  about  player  2's  play,  but  given  player  I's 

beliefs,  the  lowest-cost  way  of  obtaining  this  information  is  by  playing 

g 
R..  ,   and  player  1  may  never  play  M..  .    Since  not  all  experiments  need  be 

made,  our  proof  will  use  a  more  indirect  approach. 

Very  briefly,  our  proof  derives  both  upper  bounds  and  lower  bounds  on 

the  players'  option  values,  that  is,  the  difference  between  their  expected 

payoff  in  the  current  round  and  their  expected  average  present  value.   We 

argue  that  if   s.   has  positive  probability  in  the  limit,  most  players  using 

s.   do  not  expect  to  learn  much  more  about  its  consequences,  and  play   s. 

because  it  maximizes  their  current  payoff.   For  these  players  the  option 

value  of  the  game  is  low.   However,  we  also  show  that  if   s.   is  not  a  best 

response  to  the  steady  state  distribution  and  players  are  patient,  then  they 

are  very  unlikely  to  observe  a  sample  that  makes  their  option  value  small, 

thus  obtaining  a  contradiction.   Intuitively,  if  some  strategy   s.   is  a 

better  response  than  s.   to  the  steady  state  distribution  8    .,   and  player 

i's  beliefs  assign  non-negligible  probability  to  the  opponents'  strategies 

lying  in  a  neighborhood  of  8    . ,      then  player  i  should  have  a  positive 


is-1  Q-10 


\o,o 


Figure        j^ 


opcion  value  from  the  possibility  of  playing   s.,   and  this  option  value 
will  be  large  if  player  i  is  sufficiently  patient.   Moreover,  since  player 
i's  prior  assigns  positive  probability  to  all  neighborhoods  of  6    ., 
repeated  observations  from  the  distribution  generated  by  6    .      should  be 
unlikely  to  substantially  reduce  the  probability  player  i  assigns  to 
neighborhoods  of  B    . . 

The  proof  of  Theorem  5.1  uses  five  lemmas  proven  in  Appendix  B.   Before 
proving  the  theorem  we  first  discuss  the  lemmas.   Recall  that   s.(y.)   is 
the  optimal  choice  of  strategy  for  a  player  with  k  periods  of  play 
remaining  and  beliefs   g.(»|y.). 

Lemma  5.2:   There  exists  a  function  r\{u)   ■*   0   such  that  for  all  y.   and  S 

I  k    I 

(5.2.1)      max   u.(s.y.)  -  u.(s.(y.)y.) 

L 

-'^i^^i^  -  u.(s^(y.)|y.) 

<  [6/(1-5)1  "'^^xex(s^(y.))  p('^lyi>''("('^iyi))- 

The  first  inequality  follows  from  the  fact  that,  because  it  is  possible  to 

achieve  average  payoff  max   u.(s.|y.)   by  ignoring  all  subsequent  observa- 

i 
tions  and  playing  the  strategy  that  maximizes  u.(s.|y.)   for  the  remainder 

k  I 

of  the  individual's  lifetime,    V.(y.)  >  max   u.(s.  y.). 

1  -'  1       s.   1   1  -^  1 

1 

To  understand  the  second  inequality,  observe  that 

k  k    I 

V.(y.)  -  u.(s.(y.)  y.)   is  a  measure  of  the  option  value,  or  anticipated 

capital  gain,  to  the  information  generated  by  playing  s.(y.).   The  second 

inequality  asserts  that  as  the  total  number  of  observations  on  nodes  that 

are  reachable  under  s.(y.)   becomes  large,  when  weighted  by  the  empirical 

probability  p   that  these  nodes  are  reached,  the  option  value  becomes 

small.   The  idea  behind  this  conclusion  is  that  once  a  player  has  many 


observations  of  play  at  a  given  information  set,  another  observation  vill 
not  change  his  beliefs  about  that  information  set  very  much,  as  has  been 
shown  by  Diaconis  and  Freedman  [1989].   (This  relies  on  the  assumption  of 
non-doctrinaire  priors.)   In  the  context  of  a  simple  multi-armed  bandit 
model,  their  result  implies  that  the  option  value  of  each  arm  is  bounded  by 
a  decreasing  function  of  the  number  of  times  each  arm  has  been  played.   The 
reason  that  the  third  expression  in  (5.2.1)  is  more  complicated  than  this  is 
that  in  our  model  players  know  the  extensive  form  of  the  game.   This  means 
that  they  may  know  that  some  large  samples  are  not  "representative" ,  and  for 
these  large  samples  the  option  value  may  still  be  large. 

As  example  of  this  possibility,  consider  the  game  in  Figure  2.   In  this 
game,  if  player  1  plays   D,   then  players  2  and  3  simultaneously  choose 
between  L  and  R;   player  4  only  gets  to  move  if  players  2  and  3  both  play 
R.   Now  suppose  that  player  1  has  played  D   200  times,  and  that  he  has 
observed  100  draws  of   (D,L,R),   and  100  draws  of   (D.R.L).   Then  if  his 
prior  beliefs  on  U„     and  IT.   are  uniform,  his  posteriors  will  be  concen- 
trated in  the  neighborhood  of  players  2  and  3  each  playing  the  strategy 
(1/2  L,  1/2  R) .   Given  these  beliefs,  player  1  believes  there  is  a 
substantial  (about  1/A)  probability  that  playing  D  will  lead  to  player  4's 
information  set,  so  that  information  about  player  4's  play  is  potentially 
valuable,  yet  player  1  has  not  received  any  observations  of  player  4's  play. 

The  point  is  that  because  player  1  knows  that  players  2  and  3  play  simul- 

9 

taneously,  he  treats  the  observed  correlation  in  their  play  as  a  fluke. 

Of  course,  if  players  2  and  3  do  choose  their  strategies  independently, 
the  sort  of  spurious  correlation  in  this  example  should  be  unlikely,  so  that 
most  large  samples  should  lead  to  small  option  values.   This  is  established 
by  Corollary  5.5  below. 


Figure  2 


The  final  important  remark  about  Lemrca  5.2  is  that  it  places  weaker  and 
weaker  bounds  on  the  option  value  as  6      goes  to   1,   as  a  very  patient 
player  will  have  gains  from  learning  when  faced  with  even  a  small  amount  of 
uncertainty.   This  property  is  what  has  prevented  us  from  extending  our 
proof  of  Theorem  5.1  to  limits  where  S      goes  to   1   faster  than  T   goes  to 
infinity. 

Lemma  5.2  provides  an  upper  bound  on  expected  gains  from  learning  about 
the  consequences  of  a  strategy;  the  next  lemma  provides  a  lower  bound. 

Define 


P(s^,A,y^)  -  max  J  gi(f -i ly i)dT_i 

^i  (7r_j_|ui(s|,?r_i)>ui(si(yi)  ,7r_i)+A) 


to  be  the  largest  posterior  probability  (given  the  sample  y.)   that  a 
strategy  yields  a  gain  of  A   over   s..   If  this  probability  is  large  and 
the  player  is  patient,  the  option  value  ought  to  be  large. 


Lenuna  5.3:   For  all   £  >  0  and  A  >  0,   there  are  i.  <  1   and  K  such  that 
for  all  6   e  [i,l)   and  for  all  k  >  K 

k                    vH(yi)  -  Ui(sH(yi)|yi) 
(5.3.1)      A  .  P(s^(y.),A,y.)  -  e  <  _i_i ]_., 

Now  we  turn  to  the  issue  of  what  fraction  of  the  population  has  sampled 

frequently  enough  to  have  discovered  (approximately)  the  true  distribution 

of  opponents  play.   Stating  the  desired  result  requires  some  additional 

notation.   Fix  a  horizon  T  ,   a  plan  r.   for  the   i    kind  of  player,  and 

m  1 

the  population  fraction  of  opponents'  actions  S_..      From  (3.7)  we  can 

calculate  the  corresponding  steady  state  fractions  6.      for  the  population 

i.   Let  Y.  C  Y.   be  a  subset  of  the  histories  for  player   i.   We  define 
11  '^     ■' 


11      ^.    1  ■  1 

to  be  the  steady  state  fraction  of  population   i   in  category  Y. .   First  we 
examine  the  relationship  between  the  probabilities  of  nodes  as  measured  by 
maximum  likelihood,  and  the  true  population  value. 

Lemma  5.4:   For  all   e  >  0   and   functions  rj      such  that   r)(n)  -»  0 

as   n  -  <=   there  is  an  N  such  that  for  all  T  ,  6    .  ,      r.  ,   and   s. 

m    -1     1         1 

(5.4.1)      e^    (max_      p.  (x|y.  )r;(n(x|y. ))  >  c,    and  n(s.|y.)>N)  <  c 

This  asserts  that  few  people  have  a  large  sample  on  the  strategy   s.,   few 
hits  on  a  reachable  node,  and  a  high  maximum  likelihood  estimate  of  hitting 
it.    Notice  the  nature  of  the  assertions:   the  fraction  of  the  population 
that  both  have  a  large  sample  and  an  unrepresentative  one  is  small.   It  need 
not  be  true  that  of  those  that  have  a  large  sample  most  have  a 
representative  sample.   Since  the  sampling  rule  is  endogenous,  we  must  allow 
the  possibility  that  sampling  continues  only  if  the  sample  is  unusual  (e.g., 
keep  flipping  until  tails  have  been  observed) . 

Our  next  step  is  to  combine  Lemmas  5.2  and  5.4  to  conclude  that  players 
are  unlikely  to  repeatedly  play  a  strategy  solely  for  its  option  value. 
Intuitively,  if  strategy   s.   has  already  been  played  many  times,  the 
player's  observations   y.   should  have  provided  enough  observations  at  the 
relevant  nodes  that  the  player  is  unlikely  to  learn  very  much  by  playing  s. 
again.   As  we  explained  earlier,  the  reason  this  conclusion  only  holds  for 
most  large  samples,  as  opposed  to  all  of  them,  is  that  since  players  know 
the  extensive  form  of  the  game,  they  may  know  that  their  sample  is 
"unrepresentative"  of  the  true  distribution. 


Corollary  5  5:   For  all  e    >   0      there  exists   N   such  that  for  all   T   and 

m 

all  6. 

m   k  Ic      I  iTi  I 

^.(V.(y.)  -  u.(s^(y.)|y.)  >  6e/(l-5),  n(r.|y.)  >  N)  <  £, 

where   k  -  T   -  t(y.)   is  the  number  of  periods  remaining  given  history  y. . 

Proof  of  Corollary:   Substitute  the  second  inequality  of  (5.2.1)  into 
inequality  (5.4.1).  | 

This  corollary  shows  that  even  very  patient  players  will  eventually  exhaust 
the  option  value  of  a  strategy  they  have  played  many  times;  of  course,  the 
N   required  to  satisfy  (5.5)  may  grow  large  as   5  -»  1 . 

Our  next  lemma  asserts  that  regardless  of  sample  size,  players  are 
unlikely  to  become  convinced  of  "wrong"  beliefs.   Given  h  e  H  .   and 
a  €  A(h) ,   we  can  calculate  n    .(a|^  .)   to  be  the  conditional  probability 
that   a   is  chosen,  given  that  h   is  reached,  and  p.(x|fi  .)   to  be  the 
probability  of  reaching  node  x.    Define 

B^(7™.)  -  {IT    .  :||p.(z|?°.)  -  p.(z|jr  . )  11  <  e    for  all  zeZ) 

to  be  the  beliefs  n    .      that  yield  approximately  the  same  distribution  over 

terminal  nodes  as  6    . .      Let 

-1 


^^^i'^i)  -      J      Si^'^-ilV^'^-i 


B^(5:^.) 

be  the  corresponding  posterior  probability,  and  let  Q  {6    .|0)   be  the  prior 
probability.   The  result  of  Diaconis  and  Freedman  mentioned  earlier  implies 
that  along  any  sample  path,  the  posterior  beliefs  converge  to  a  point  mass 
on  the  empirical  distribution;  the  strong  law  of  large  numbers  implies  that 
players  are  unlikely  to  have  a  sample  that  both  reaches  an  information  set 
many  times  and  gives  misleading  (that  is,  biased)  information  about  play 


there.   These  two  facts  underly  the  following  lemina.   Note  well  the  order  of 

quantifiers  here:   a  single   7   can  be  used  for  all  sair.ples   y.,   steady 

states  S    ,   and  lifetimes   T  . 

m 

Lenma  5.6:   For  all  c      there  exists  a   7   such  that  for  all   y. ,   ^  . ,  r. 
■'i    -1    1 

and  T 
m 

/■  m .  ^ i  / T")  I   V  ,^i  .Tin  I  ,^  ^     , 
i'^e^  -i'^i^/^f^  -i'  ■*  -  t'  -  «■ 

Our  last  lemma  asserts  that  if  the  population  fraction  playing  a 
strategy  is  positive,  the  population  fraction  that  has  played  it  a  number  of 
times  must  be  sizeable  as  well. 

Lemma  5.7:  Let  T  -  «  be  the  length  of  life,  and  6  be  a  subsequence  of 
steady  states  that  converge  to  6,  and  let  r.  be  the  corresponding  rules. 
Then 

(5.7.1)      fi^ln.Cs.  |y.)  >  N  and  r°'(y.)  -  s.)  >  7™(s.)  -  (N/T  ) 

With  these  lemmas  in  hand,  we  can  now  prove  Theorem  5.1:   Any  limit 
point  $      of  steady  states  is  a  Nash  equilibrium.   Here  is  a  rough  sketch  of 
the  proof:   If  S      is  not  a  Nash  equilibrium,  then  some  player  i  must  be 
able  to  gain  at  least   3A  >  0  by  not  playing  some  strategy  s.   that  6 
assigns  positive  probability.   If  player  i's  beliefs  assign  probability 
close  to  1  to  a  neighborhood  of  6,      he  would  assign  a  non-negligible 
probability  to  the  event  that  his  opponents'  strategies  are  such  that  he 
could  gain  at  least  A  by  deviating;  call  this  event   E.   Now  player  i's 
prior  beliefs  are  nondoctinaire ,  and  hence  assign  a  non-zero  probability  to 
E,   and  from  Lemma  5.6,  for  most  samples  player  i's  posterior  does  not 
assign   E  a  vanishingly  small  weight.   For  all  such  samples,  Lemma  5.3 


implies  chat  player  i's  expected  gain  from  learning  is  not  negligible 
provided  his  discount  factor  is  sufficiently  high  and  he  has  sufficiently 
many  periods  of  life  remaining. 

However,  since   ^(s.)   is  positive,  Lemma  5.7  implies  chat  a  non- 
negligible  fraction  of  the  player  i's  have  played   s.   many  cimes  and  intend 
to  play  ic  again.   From  Corollary  5.5,  if  S      is  large,  most  of  chese 
players  muse  have  negligible  expected  gains  from  learning  about   s.,   which 
concradicts  the  conclusion  of  the  last  paragraph  that  player  i  is  unlikely 
to  have  samples  that  give  him  a  negligible  gain  from  learning. 

Proof  of  Theorem  5.1:   We  will  first  show  that  for  each  A  >  0   there  is  a 

function  T(i5,A)   such  that  if   5   -  1   and  T  >  T(5  ,A)   any  accumulation 

m  mm 

point  6      of  the  sceady  states  6        has  the  property  that  no  player  can  gain 

more  than  3a  by  deviating  from  6..      That  is,  for  all  players  i,  all   s.  e 

supporter.)   and  all   s '.  ,  \i.{s.,6    .)  >  u.(s!,fi  .)  -3A.   The  existence  of  the 
1  111-1     11-1 

desired  function  T(5)   will  follow  from  a  diagonalization  argument. 

Thus,  we  fix  a  A  >  0   and  a  sequence  of  positive  numbers  t      -•  0,   and 

let  k(£  ,A)   satisfy  the  conditions  of  Lemma  5.3  for   £  -  e   and  A.   Let 
m  m 

/  1  _r  N  2 

N(5)   satisfy  the  conditions  of  Corollary  5.5  for  S      and  e'    -  -i ' 

6 

so  that 

(5.1.1)      ^I'^i^yi^  -  ^iCSi(yj_)lyi)  >  (^-'5).  n(r°(y.)|y^)  >  N(5)) 

Finally,  choose  T(£  ,6, A)  -  [k(£  ,A)  -(■  N((5)  ]/(l-5)  . 

m  m 

Extracting  a  subsequence  if  necessary,  we  suppose  that  6      is  a  limit 

of  steady  states  7™  for   (5  ,T  ),   with   5-1   and  T  >  T(e  ,5  ,A).   We 

mm  m  m      m  m 

claim  that  no  player  can  gain  more  than  3A  by  not  playing  some 

s.  G  supporter.).   If  this  claim  is  false,  then  for  m  sufficiently  large 


there  is  an   s!   such  that  u.(s'.,6    .)  >  u.(s.,^  .)  +  2A.   Since 

1  11-1      11-1 

S.(s.)    >   8.(s.)/2      for  sufficiently  large   m,   it  follows  that  for  all 

sufficiently  large   m  and  all  sufficiently  small   e  >  0,   any  profile  for 

i's   opponents  that  is  within   e   of   ^  .   gives  a  gain  of  at  least   A   from 

playing   si   instead  of   s..   Thus  for  any  sample   y . ,   the  maximized 

probability  P(s.,A,y.)   that  some  deviation  from  s.   yields  a  gain  of  at 

least  A   is  at  least  the  posterior  probability  Q  (6    .|y.)   that  player  i 

assigns  to  profiles  in  the  set   B  (8    .).   From  Lemma  5.6,  there  is  a  7  >  0 

such  that  for  all  T   and  8    , 

m 

«"{qJ(^.  Iy^)/Qj(^j0)  <  7)  ^  ^.(s.)/4, 

This  shows  that  not  too  many  player  i's  can  have  observed  samples  that  have 
caused  them  to  substantially  lower  the  probability  they  assign  to  an  e- 
neighborhood  of  the  true  steady  state.   Moreover,   Q  (6    .|0)  >Q>0   since 
the  prior  is  uniformly  bounded  away  from  zero  by  our  assumption  of  non- 
doctrinaire  priors.   Using  this  fact  and  our  earlier  observation  that 
P(s^,A,y^)  >  Qj(^ilyi).   we  have 

(5.1.2)      fi^(P(s^,A,y.)  <  7Q)  ^  ^^(s.)/4,   so  that 

6^(P(s^,A.y.)  >  7Q)  >  1  -  ^.(s.)/4 

Inequality  (5.1.2)  gives  us  a  lower  bound  on  the  fraction  of  player  i's 
who  assign  a  non-negligible  probability  to  any  strategy  yielding  a  A 
improvement  over  s..   Our  next  steps  are  to  argue  that   (i)   since   s.   has 
positive  probability  in  the  limit,  there  must  be  a  non-negligible  proportion 
of  the  population  that  has  played  s.   many  times,  intends  to  play  it  again, 
and  assign  a  non-negligible  probability  to  any  strategy  improving  on  s., 
and   (ii)   there  must  therefore  be  a  non-negligible  proportion  of  the 
players  who  play  s.   even  though  they  have  played  it  many  times  before  and 


have  a  non-negligible  expected  gain  from  learning.   This  last  conclusion 

will  then  be  shou-n  to  contradict  (5.1.1). 

To  carry  out  this  program,  use  Lemma  5.7  and  the  facts  that 

7"(s.)    >  7.(s.)/2      and     N(5    )/T     <  N(<5    )/T(f    ,5    ,A)    <   1    -    5        to   conclude 
1111  mm  mmm  m 

(5.1.3)  (?"(n(s    ly    )    >  N(5^)  ,    r^y    )    -   s    )    >  7.  (s    )/2    -    (1-6    ). 

1  11  m  11  1  11  m 

From  DeMorgan's  law,  the  probability  of  the  intersection  of  the  events  in 

(5.1.2)  and  (5.1.3)  is  at  least  the  sura  of  the  individual  probabilities 

minus  1 ,  so  that 

(5.1. A)      «^(P(s.,A,y.)  >  72,   n(s.|y.)  >  N(5^).   r^(y.)-s.) 

>  ^.(s  )/4  -  il-SJ. 
1   1         m    . 

By   construction,      1    -    k(£    ,A)/T     >   1    -    k(e    ,A)/T(«    ,S    ,A)    >  5    ,      i.e.,    at 

ED  m  m  ID     111  m 

least   a      S        fraction  of  the  player    i's  have   at   least     k     periods   of  life 
remaining    (since   each   generation   is   of   the   same   size).      From  Lemma   5.3,    when 
5        is   sufficiently   large,    the   expected  gains    to   learning  of  all  players  who 
have   at   least     k     periods   of   life   are  bounded  below  by  a   function  of     A   •    P; 
substituting   this   function   into    (5.1.4)    and  using  DeMorgan's   law  again  yields 

(5.1.5)  ^"([V^y.)    -   u   (s^(y   )|y   )   >   (A7Q- €„)/(l-f„)  ]  . 

Ill  1111  m  m 

n(s.|y.)   >  N(SJ,    r°(y.)   -  s.) 

>  ?    (s    )/4-2(l-5    ). 
11m 

If  we  now  choose  m   large  enough  that   (1-5  )  <  (A7Q-£  )/(l-«  )   and 

ni  mm 

7.(s.)/4    -    2(1-5    )    >    (1-5    )^/5    ,      we   conclude    that 
II  m  mm 

6^{[V^(y.)    •   u.(s'^(y.)|y.)   >   (1-5    )]. 
1        1  -^  1  11-^1-^1  m 

n(s.|y.)   >N(5^).r^(y.)   -s.) 


which  contradicts  (5.1.1)  because  Che  event  in  this  display  implies  the 
event  whose  probability  is  bounded  in  (5.1.1).  ■ 

6 .   Conclusion 

We  conclude  by  examining  the  scope  of  Theorem  5.1.   Can  every  Nash 
equilibriuni  of  every  finite  game  be  realized  as  a  limit  of  steady  states  as 
T  -•  00  and  6   ■*   17      Is  it  really  necessary  for   6  -•  1   to  achieve  Nash 
equilibrium  in  the  limit  as   T  ->  «? 

Concerning  the  issue  of  which  Nash  equilibria  can  be  obtained  as  limits 
of  steady  states,  we  do  not  have  a  complete  answer.   It  is  easy  to  see  that 
limits  of  steady  states  can  be  mixed  as  well  as  pure  strategy  equilibria. 
Since  any  limit  point  must  be  a  Nash  equilibrium,  this  must  be  the  case  in 
any  game  that  does  not  have  a  pure  strategy  equilibrium.   However,  whether 
as  T  -►  «  and  S  -*   1      it  is  actually  possible  to  attain  some  refinement  of 
Nash  equilibrium  must  await  further  research. 

If  T   is  not  large,  players  do  not  have  much  data,  and  play  may  be 
quite  arbitrary  and  heavily  influenced  by  priors.   What  if  T  -»  w,   but  8 
is  not  close  to  one?   In  this  case  players  will  have  a  great  deal  of 
information  about  those  strategies  they  have  chosen  to  play,  but  players  may 
not  have  much  incentive  to  invest  in  exploring  many  strategies.   Consequent- 
ly, play  may  fail  to  be  Nash  because  untested  beliefs  about  opponents'  play 
off  the  equilibrium  path  are  not  correct. 

To  see  that  for   5/1   a  sequence  of  steady  states  may  fail  to 
converge  to  Nash  equilibrium,  we  consider  an  example  due  to  Fudenberg  and 
Kreps  [1991].   Consider  the  3-person  game  shown  in  Figure  3.   Suppose  that 
the  prior  of  player  1  is  that  player  3  will  play  R  with  very  high 


(1.1.1) 


(3,0,0)  (0,3,0) 


(3,0,0)  (0,3,0) 


Figure    3 


probability   (>  2/3) ,   while  char  of  player  2  is  that  3  will  play   L  with 
verv  high  probability.   If   i5  -  0 ,   or  is  very  small,  consider  a  candidate 
for  a  steady  state  in  which  all  player  I's  always  play  A..   and  all  player 
2's  always  play  A„ .   This  is  optimal  in  the  first  period  of  life  given  the 
priors  and  low  discount  factor,  and  as  a  result  no  information  about  player 
3  is  gained,  and  the  proposed  play  is  again  optimal  in  the  second  period  of 
life  and  so  forth.   Consequently,  regardless  of  T   this  constitutes  a 
steady  state.   On  the  other  hand,  in  any  Nash  equilibrium  either  1  must  play 
D   or  2  must  play  D„ . 

We  conclude  with  a  brief  examination  of  the  consequences  of  the  fact 
that  as   T  -»  =0  players  must  have  a  great  deal  of  information  about  chose 
strategies  they  have  chosen  to  play.   Formally,  let  H(a)   be  the  informa- 
tion sees  reached  vich  posicive  probability  when  the  mixed  strategy  profile 

a      is  plaved,  let  A(c7)  -   -W    A(h)   be  the  actions  at  those  information 

h€H(c7) 

sets  and  let  X(c7)  -   -^         (x|x6h)   be  the  corresponding  nodes.   Let 

heH(cr) 

:r.(»|c7  .)   be  the  behavior  strategy  for  i's  opponents  that  corresponds  to 

a    ..      Let  us  say  that  the  beliefs  u .      (over  t    .)      are  confirmed  for  s. 
-1  ■'  1  -1       1 

and  a    .       if 
- 1 

max  7   ,       ,r||»r  .(a)  -  jt  .(ala  .)|U.(d?r  .)  -  0, 
-1   1   -1 

chac  is,   /i.   puts  probability  one  on  the  same  play  of  opponents  as  does 

a    .      at  those  information  sets  that  are  reached  when  is., a    .)   is  played. 
-1  1   -L      "^   -^ 

Tnis  captures  the  idea  that  untested  beliefs  about  opponents'  play  off  of 
the  equilibrium  path  can  be  incorrect. 

Theorem  6.1:   For  fixed  priors   g.   and   5  <  1   as   T  -*  "  every  sequence 

6        of  steady  states  has  an  accumulation  point  6;      if  6  . {s .)   >  0      there 

exist  beliefs   u.   that  are  confirmed  for   s.   and  d    .      and  such  that   s. 
1  1-11 


maximizes   u.  (  •  i  ^i.  )  . 

1     1 

Remark :   This  notion  of  equilibrium  is  equivalent  to  the  self -confirming 
equilibrium  defined  and  characterized  in  Fudenberg  and  Levine  [1991]. 

The  idea  of  the  proof  should  already  be  clear  from  our  discussion  of 
the  proof  of  Theorem  5.1:   Long-lived  players  will  eventually  stop 
experimenting  and  play  to  maximize  their  current  expected  payoff  given  their 
beliefs,  and  their  beliefs  about  the  payoff  from  any  strategy  they  play  many 
times  are  approximately  correct.   The  formal  proof  is  in  Appendix  C. 


ENDNOTES 


This  concept  is  closely  related  to  the  "conjectural  equilibria"  of 
Battagalli  and  Guatoli  [1988],  the  "rationalizable  conjectural  equilibria" 
of  Rubinstein  and  Wolinsky  [1990],  and  the  "private  beliefs  equilibrium"  of 
Kalai  and  Lehrer  [199?r. 

2 
See,  however.  Canning  [1989]. 

3 

To  avoid  a  trivial  special  case  in  one  of  our  proofs,  we  will  suppose 

»^Z  >  1. 

We  denote  the  actions  so  that  A(h)  n  A(h')  -  ^   for  h  ^  h' . 

This  is  known  as  Kuhn's  theorem  (Kuhn  [1953]).   Recent  presentations 
of  this  result  can  be  found  in  Fudenberg  and  Tirole  [1991]  and  Kreps  [1990], 
among  other  places. 

Boylan  [1990]  has  shown  that  this  deterministic  system  is  the  limit  of 
a  stochastic  finite-population  random-matching  model  as  the  number  of 
players  goes  to  infinity. 

Several  readers  have  asked  us  the  following  questions:   Won't  players 
update  their  beliefs  in  the  course  of  period- t  play  as  they  observe  the 
actions  of  their  opponents?  And  shouldn't  they  therefore  deviate  from  the 
original  play  of   r.(y.)?   The  answers  are  that  yes,  players  will  update 
their  beliefs  in  the  course  of  period-t  play,  but  that  the  optimal  plan  at 
the  beginning  of  the  period,   r.(y.),   already  takes  this  revision  into 
account.   Intuitively,  player   i   can  foresee  that  his  posterior  beliefs 


will  be  ac  every  information  set,  and  his  optimal  plan  will  thus  maximize 
his  expected  utility  at  each  information  set,  conditional  on  that   informa- 
tion set  being  reached.   (Remember  that   r.(y.)   is  a  strategy  for  the 
extensive- form  game,  that  is,  it  specifies  a  feasible  action  at  every 
information  set.   It  does  not  specify  the  "same"  action  at  every  information 
set;  indeed  by  definition  an  action  that  is  feasible  at  one  information  set 
cannot  be  feasible  at  another.) 

0 

As  an  aside,  we  note  that  this  example  also  shows  why  our  learning 
model  does  not  yield  results  in  the  spirit  of  forward  induction  (Kohlberg 
and  Mertens  [1986]).   Forward  induction  interprets  all  deviations  from  the 
path  of  play  as  attempts  to  gain  in  the  current  round.   Since  L.   strictly 
dominates   R..  ,   forward  induction  argues  that  player  2  will  believe  that 
player  1  has  played  M   whenever  player  2's  information  set  is  reached,  and 
hence  that  player  2  will  play  L„ ;   this  will  lead  player  1  to  play  M. .   In 
contrast,  in  our  model  player  1  deviated  from  L..   to  gain  information  that 
will  help  him  in  future  rounds,  and  the  cheapest  way  to  do  this  is  to  play 
R- .   When  R..   is  more  likely  than  M..  ,   R„   is  optimal  for  player  2. 

9 
Fudenberg  and  Kreps  [1991],  ch.  9,  develop  a  learning  model  in  which 

players  do  not  know  the  extensive  form  of  the  game.   We  believe  that  Theorem 

5.1  would  extend  to  this  context,  but  Theorem  6.1  (on  self -confirming 

equilibrium)  would  not. 

The  more  obvious  probability  space  would  have  elements  corresponding 
to  what  player  i  would  see  if  he  plays  s.   in  period  t  for  each  date  t 
of  his  life.   Then,  for  each  sample  path,  the  realized  terminal  node  the 
player  sees  the  first  time  he  plays   s.   depends  on  the  period  in  which  the 


strategy  is  first  used.   Our  alternative  generates  the  same  probability 
distribution  over  observations. 


APPENDIX  A 


Lemma  A . 1 


If  X. ,y.  €  B 

1  ■'i    + 


i-1 


n  ,x.-n.  ,y.  <  I.  , (n:  ,v.)  x.-y.  (n.  .^,x.) 
i-i  1  i-i-'i'   1-1  j-i'j   1  1  j-i+i  J 


so  if  X.  <  1 

1 


Proof: 


i-1 


n.  .x.-n.  ,y.   <  Z.  ^ (n.  ,y.)  x. -y. 
1-1  1   1-1-^1     1-1   J-1  J    1   1 


1-1  1  1-1^ 1 


I      Trn  tt''  tt'1  tt"  I 

-  y,n.  -x.  -  n.  ^y.  +  n.  .x.  -  y,n.  „x. 
^1  1-2  1    1-1-^1    1-1  1   -^1  1-2  1 

-  yi'"  i-2^i-V2yii  ^  "i-2^l^ryii 

-  yiy2i"i-3^-"i-3yii  ^  yi"i-3^i''2-y2i  -^  "i-2^'^ryii 

<  z"  ,(n^"J-y.)  ix.-y.  |(n"  .  ,x.). 
1-1  j-1  J    1   1   j-i+1  J 


Lemma  A.  2 :   Suppose  that   A(7r|y)   is  a   likelihood  function  for  r  given  a 

sample   y,   that   g  (tt)   and  h  (»r)   are  two  prior  densities  both  of  which 

are  bounded  above  by  g  and  bounded  below  by  g  >  0.   If  gCfly)  and 
h(7r|y)   are  the  corresponding  posterior  densities 


Proof:   Proceeds  via  the  calculations 


g(7r|y)    ^        A(:r|y)gQ(>r)         /a  (>r  |  y)h°(7r)d^r 


:£    (i/g)^ 
-(g/g)^ 


^   ^    jA(7r|y)(hQ(ir)-gO(7r))d7r 
/A(ir|y)gO(^)d^ 

I         [i-S]J"MT|y)d»r 

^    *   T \ 

£j  A(7r|y)d?r 


APPENDIX  B 

We  begin  Che  proof  of  the  various  lenunas  by  demonstrating  the  basic 
fact  that  in  large  samples  the  posterior  is  uniformly  close  to  the  empirical 
distribution:   this  is  used  in  several  places  below. 


Lemma  B.l:   For  all  strictly  positive  priors   g.   there  is  a  nonincreasing 
function  r?(n)  -  0   as   n  -►  «>  such  that  for  all  samples   y.  ,   information 
sets   h,   and  actions   a  e  A(h)   strategies   s.,   and  terminal  nodes 
z  e   Z(s.), 

(B.1.1)       /1Ip.(z|t_.)  -  p.(2|y.)||  g.(^_.ly.)d;r_. 

<  ^^(g.x  PiCxIy^)  n(-n(x\y^)).      and 

(B.l. 2)      Jlk.j_(a)  -  ^^.(a|y.)||  g.  (7r_ .  |y .  )d^_ .  <  »?(n(h  |y . ) )  . 

Remark:   Diaconis  and  Freedman  [1989]  show  that  for  all  samples,  Bayes 
estimates  of  multinomial  probabilities  converge  to  the  sample  average  at  a 
rate  that  is  independent  of  the  particular  sample  so  long  as  the  prior 
assigns  strictly  positive  density  to  all  distributions.   One  complication  in 
our  model  is  chat  even  if  strategy  s.   has  been  played  many  times,  there 
may  be  information  sets   h   that  are  reachable  under   s.   but  have  not  been 
reached  in  the  sample.   A  second  complication  Is  caused  by  the  fact  that  the 
distribution  on  terminal  nodes  generated  by  the  sample  average  strategies  r 
does  not  equal  the  sample  average  on  the  terminal  nodes.   This  explains  the 
complicated  form  of  the  right-hand  side  of  (B.1.1). 

Proof:   Fix  a  strategy  s.   and  terminal  node  z  e  Z(s.).   Let 

(a  .(i,2))     be  the  actions  by  other  players  (including  nature)  that  lead 

to   z,   and  let  x(l)   through  x(L)   be  the  nodes  where  those  actions  are 


taken.   It  follows  from  (2.1)  and  Lemma  A.l  in  Appendix  A  chat 
(B.1.3)       ||p.(zU_.)  -  p.(z|y.)|| 

^  ,  i  , 

<  Y      p.(x(i)|y  )||,:_.(a_.(i,z))  -  :^^.(a_.(i,z)|y.)|| 

For  each   h  €  H  .   and   «  <  l/#A(h)   let   B   be  the  sphere   in 
A(A(h))   of  radius  t      in  the  sup  norm  centered  at   jt  .(h),   and  let  11 
be  the  set  of  behavior  strategies  for  information  sets  other  than  h.   Since 

\n    .(a)-7r  .(ajy.)!  <  £   on  the  set   B   x  II"    for  all   a  e  A(h)  , 

-  1       -1    '  ■'  X     '  i 

(B.l.^)        /li'^.i(a)  -  ^'.(a|y.)||  g  .  (^,  .  I  y  .  )d;:_  . 


<  £  + 


f    g .  ( T  .  I  y . )  dJT  .  , 
J     ''i   -1  -^  1    -1 


-B^xn"^ 

Suppose  first  that   g.   is  a  non-doctrinaire  prior  for  which  jr  .(h)   is 
independent  of  -k    .(h'),   h  »<  h' .   Then  the  corresponding  posterior  g. 
consists  of  a  product   g.  -  IL  g.   where   g.   is  a  multinomial.   Diaconis 
and  Freedman  [1989]  show  that  for  a  multinomial,  if   £  <  l/#A(h)   the 
posterior  odds  ratio 

;    g^(»r  .(h)|y.)d7r  .(h)/J   i5(^  .  (h)  |y .  )d^  .  (h) 
-B^  B^ 

£  £ 

goes  to  zero  as  the  sample  size   n.(h|y.)  -►  «  at  an  exponential  rate  that 
is  independent  of  the  particular  sample.   Since  the  multinomials  are 
independent,  the  posterior  odds  ratio  studied  by  Diaconis  and  Freedman 
equals 

J    gi('r-ilyi)d'r-i 


-B  xn"" 


J   gi('f-ilyi)tif-i 


Li(B,^lVi). 


h 


B  xn 


-h 


Now  consider  an  arbitrary  strictly  positive  prior   g..   Since  both   g. 
and   R.   are  bounded  above  and  below,  we  can  conclude  from  Lemma  A. 2  that 
there  is  a  constant   k  >  1   such  that  for  all   y.   and   rr .  , 

g^('f.Jyi)A  <  g^('r.^ly^)  <  k|^(?r_^|y^). 
Thus  , 

J   gi('f-ilyi)d"i 

-B^T-^  2     hi 

—i <  k^  L.(B^|y.). 


J   gi('f-iiyi)dw_i 


B>^1 


and  hence  goes  to  zero  at  the  same  exponential  rate.   In  particular,  for  all 

e  >  0   there  is  an   rj  (n)  -►  0   such  that  (B.l.A)  is  less  than 

£  +  rj*  (n(h|y. )  )/L.   Choose   N    so  that  rj   ^    (n)    <   1/t   for  n  >  N  .   Then 

fj(n)  -  r?  '^  (n)   for  N,  +  ...+N   ,<n<N,  +N„+...+N   satisfies 
'    ^  1  t-1        12  t 

T7(n)  -  0   and   (B.1.4)   less  than   r7(n(h|y . )  )/L.   We  conclude  from  (B.1.4) 
that  (B.1.2)  is  valid. 

Now  we  combine  (B.1.2)  with  (B.1.3)  to  obtain  (B.1.1):   Let   h(i)   be 
the  information  set  containing  x(i),   and  note  that 
n.(h(i) |y.)  >  n.(x(i) ly.).   Then 

(B.1.5)       ;|lp.(z|^_.)  -  p.(z|y.)||  gi(T_.|y.)d^_., 


L 


^i?i 


P^(x(i)|y^)  r7(n(h(i)iy^)/L 


^xeJSi)  ^i^^'^i^  ''(-(-lyi))- 


Lemma  5.2:   There  exists  a  function   r7(n)  -»  0   such  that  for  all   y.   and  S 


38 


(5,2.1)      max^   u.(s.!y.)  -  u. (s . (y . ) |y . ) 

i 

<  V^(y.)  -  u.(s^(y.)!y.) 

<  [6/(1-6)]  ni^\£X(s^(y  ))  P^^'w ^^'''^'^'^^ly ^^^ 

I       k 
Proof:   The  first  inequality  follows  from  the  fact  that  u.(s.  y.)  <  V.(y.) 

since  playing   s.   for  the  rest  of  the  game  is  feasible  and  yields 

u.(s.|v.)   in  expected  value  each  period. 

1   1  '  1 

To  demonstrate  the  second  inequality,  observe  that  because  the  V.'s 

k- 1        k 
are  average  payoffs   V.   (y.)  i  V.(y.).   It  follows  from  the  Bellman 


equation  ('4.3)  that 


(l-«lj^)V^(y.)  <  (l-.^j^)  u.(s^(y.)|y^)  + 

4.,  V  p.(z)  [V^"-'-(y.,z)  -  V^'-'-(y.)], 

zeZ(s  j_) 


or  since  (t>     <  6 , 


vj(y.)  <  u.(sj(y.)|y.) 

+  6/(1-6)    V  p.(z)  [V^"\y  z)  -  V^'\y  )] 

zezTsi) 


The  final  sum  represents  the  expected  value  of  new  information.   Note 

k        k     I 
that  V.(y.)  -  V. (g. (• ly . ) ) ;   that  is,  the  value  function  depends  only  on 

the  current  beliefs  about  ir    . .      Consequently,  the  value  of  new  information 
should  be  small  if  the  expected  change  in  beliefs  is  small.   To  make  this 
idea  precise,  we  introduce  the  £^      norm  on  densities 

llg-g-lll  -  Jilgi('f.i)  -  6i'('f.i)ll  d^.i-   It  is  standard  that  V^(g.)   is 
Lipshitz  in  g.   with  Lipshitz  constant  U  equal  to  the  largest  difference 
in  any  two  payoffs.   Consequently 


3^ 


^.eZ(s.)  P.^^)  '^'i-^y,,z)  -  V^-^y.) 


We  may  then  calculate 


eZCs.)^!^"^  ii5iv-iv/-'^   tiv-.^i^ii^^ 


p.(z|y.)  ||g.(.|(y.,z))  -  g.(.|y.))|i^ 

-  PiCzly^)  /l|[Pi(z|T_.)g.(rr_.|y.)/p.(z|y.)  -  g.(rr_.|y.)  ||d^_. 

^  /l|Pi(z^..)  -  p.(zly.)||  g^(^_.iy.)d,r_^  + 

Jllp^c^lyi)  -  p^c^ly^)!!  g^(^,^\y^)<i^.^ 

<   2;||p.(zi;r_.)  -  p.(z|y.)||  g.  (7r_ .  |y .  )d^_  .  , 

where  the  last  inequality  follows  from 

Pi(zly^)  -/P^(zk_.)  g.(7r_.  |y^)d»r_.. 

We  can  now  take  n  to  be  the  function  whose  existence  is  proved  in  Lemma 
B.l,  multiplied  by  the  constant   2U(#Z) .  | 

Lemma  5.3:   For  all   e  >  0   and  A  >  0,   there  are  i.  <  1   and  K   such  that 
for  all  6   6  [i.,1)   and  for  all   k  >  K 

(5.3.1)      A  .  P(s^y.),A.y.)  -  e   <    -1-1 J LJi L. 

Ill  l-£ 

Proof:   From  classical  statistics,  for  all   e   and  A  there  exists  a  t 

and  a  t-period  test  procedure  for  the  hypothesis 

I  k 

w  .  e  (?r  .  u.(s.,»r  .)  >  u.(s.(y.),:r  .)+A)   such  that  the  type  I  and  type  II 
-1     -i'  1   1   -1     1   1  -'i^   -1  ■'^  ■'^ 

errors  are  less  than  £/2U  for  all  steady  states  8    . .      Consider  the  policy 
of  first  running  this  test,  and  then  playing  s.   for  the  remaining  k-t 
periods  if  the  hypothesis  is  accepted  and  playing  s.(y.)   otherwise.   For 


t   k      k 
nocarional  convenience,  let   0(t,k)  -  (5  -S    )/(!-£  )   be  the  weight  placed 

on  the  last   k-t   periods.   Then  the  hypothesis  -  testing  policy  yields  a 

utility  of  at  least 

(5.3.2)  -  (l-^(t,k))U 

+  ^(t,K)  [u.(s^(y.),y.)  +  (l-c/7^)A   •  P(sJ(y. ) ,A,y^)  -  e/2]    <   vj(y.) 

since  the  hypothesis-testing  value  cannot  exceed  the  value  of  the  optimal 
policy. 

Rearranging  terms  and  using  ^(t,k)  <  1,   1-e  <  1,   we  find  that 

(5.3.3)  AP(sJ(y.),A,y.)  <  l/[^(t ,k) (1- c/2U) ]  . 
[(l-^(t,k))U-i(t,k)u.(s^(y.),y.)  +  £/2  +  ^(t,k)  V^(y.)] 


< 


£ 


,    .  a-Mt.Vn       „  .     1:^  |vk<„  ,  .  u,<s^y,,|v.) 


^   *(  =  .k)<l-./2U)     TTTTTOT  "i"!'   -l-i"l".'f 

Ue  can  choose   ^  and  k((5)   such  that  for  each  S   €   (i.,1)   the  conclusion 
of  the  lemma  follows.  | 

We  turn  now  to  sampling  theory:   what  fraction  of  the  population  can 
have  such  a  badly  biased  sample  that  maximum  likelihood  estimation  yields  a 
poor  estimate  of  the  true  steady  state  values?   Stating  the  desired  result 
requires  some  additional  notation.   Fix  a  horizon  T  ,   a  plan  r.   for  the 
i   kind  of  player,  and  the  population  fraction  of  opponents'  actions   ^_- • 
From  (3.7)  we  can  calculate  the  corresponding  steady  state  fractions  6. 
for  population   i.   Let  Y.  C  Y.   be  a  subset  of  the  histories  for  player  i. 
We  define 

1   L     y.eY.   1  ■'i 
to  be  the  steady  state  fraction  of  population  i   in  category  Y. . 


Len-JT.a  B ,  2 :   For  all  c    >   0   and  t]  (r.)    ^   0     as   n  ->  »   there  is  an  N   for 

all   i™. ,  r",  s.   and  T   such  that  if  h  e  H  ..   a  e  A(h)   then 
-111        m  - 1 

(B.2.1)     e^(\ir^  .(a\y.)-'^    ■  (a\l    .)|  >  t,   and  n(h|y.)  >  N)  <  £. 

If  X  e  X(s.) 
1 

(B.2.2)      ^"(nCxIy.)  <  [p.(x|?".)  -  £]n(s.|y.),   and  n(s.|y.)  >N)  <  t, 
1   ^  i-'i'    "'i^    -i'        I'-'i  1-^1 

(B.2.3)       ^'f(n'ax||p^(x|yi)-pi(xU^.)||   >  e.      and  n(s.|y.)  >  N)  <  e,   and 
X  ^   ^ 

(B.2.4)       ^f  (max   p.  (x  |?^.  )f?(n(x  |y . ) )  >  e,      and  n(s.|y.)>N)  <  e. 
1    X     1     ~i         1  11 

The  first  statement  says  that  the  population  fraction  having  both  a  badly 
biased  sample  on  a  €  A(h)   and  a  large  number  of  observations  on  h   is 
uniformly  small  regardless  of  6    . .      The  second  says  that  the  population 
fraction  that  has  both  played  a  strategy  making  x   likely  many  times,  and 
has  seen  x   only  rarely  is  uniformly  small  regardless  of   6    . .      Notice  what 
is  asserted:   the  fraction  both  with  a  large  sample  and  a  biased  sample  is 
small.   It  need  not  be  true  that  of  those  who  have  a  large  sample  most  have 
an  unbiased  sample. 

Proof  of  Lemma  B.2:   Since  each  period's  actions  are  i.i.d.,  we  can  model 
the  distribution  on  opponents'  actions  faced  by  player  i  by  a  probability 
space  whose  elements  are  what  he  will  see  the  k   time  he  arrives  at  inform- 
ation set  h.     The  Glivenko-Cantelli  theorem  shows  that  the  empirical 
distribution  at  each  information  set  h   converges  to  the  distribution 
induced  by   ^  .   as  the  number  of  observations   n(h|y.)  -»  »,   at  a  rate  that 

holds  uniformly  over  all  values  of  6    .      and  the  finite  number  of  informa- 
^  -1 

tion  sets.   It  remains  to  explain  why  the  rate  is  uniform  over  all  "sampling 
rules"   r.(y.).   This  follows  from  the  fact  that  the  desired  inequalities 
hold  even  if  player  i  is  informed  of  the  entire  sample  path  of  each 


information  sec  before  choosing  his  rule:   Through  such  anticipatory 
sampling  he  may  be  able  to  ensure  that  most  of  the  long  samples  are  biased, 
but  since  there  are  few  paths  where  the  empirical  distribution  is  not  close 
to  the  theoretical  one  after  N   samples,  there  is  no  sampling  rule  for 
which  the  probability  of  long  and  biased  samples  is  large. 

A  similar  argument  yields  (B.2.2),  except  that  now  we  use  a  probability 
space  in  which  the  elements  are  what  player   i  will  see  the  k   time  he 
plays  strategy   s.. 

Next  we  turn  to  (B.2.3).   By  Lemma  A.l  if   (a  .(i,x))     are  the  moves 
by  other  players  or  nature  and  x(l) x(L)   are  the  nodes  leading  to   x 

(B.2.5)      llp^CxIy.)  -  Fi(x|fi^.)|l 


L 
<  T      p.(x(i)|^.)||^];.(a  .(i,x)|y)-^  .(a  .(i.x)|7^.) 


1  -1  "  -1    -1        '     -L    -1  -1 


^i?. 


L 

min(p.(x(i)|^.),||^^j^(a_.(i,x)|y  )  -  ^_^(a_^(i  ,x)  |'^^)  ||  ) 


where  the  last  step  follows  from  i>.,ir    .  ,if    .    <  1.   From  (B.2.5)  it  follows 

t^  "^1   -1   -1 

that  if 


^fx    Up  (xly  )  -  p(x|i^  )||  >  i      occurs. 


minlp.CxlT^^).  ||:^];^(a|y.)-7r_^(a|^^)i)  >  £/L 

for  some   x  e  X(s.)   and  a  e  A(h)   with  x  e  h.   Consequently,  if  we  can 
show  how  to  find  N  for  each  such  x  and  a  so  that 

(B.2.6)      ^^lp.(x|e^.)>£.||^^.(a|y.)-^_.(a|'^.)||>2e/3L,n(s.|y.)>N)  <  i/L 

then  (B.2.3)  will  follow. 


Choose   N'   so  that  (B.2.1)  holds  for   2t/3L,   and  choose   N   so  tha; 
(B.2.2)  holds  for   f/3L  and  so  that   N  >  N  /e .   If  xGh , 
n(h|y.)  >  n(x|y.).   Consequently,  if   p.(x(i)|e_.)   >  £.   by  (B.2.2). 

(B.2.7)      ^""{nChlv.)  <  N,     and  n(s.!y.)>N)  <  t/2L. 

On  the  other  hand  by  (B.2.1) 

(B.2.8)       e^{\n'^  .(a\y.)-n^.(a\7^.)\    >   2£/3L  and  n(h|y.)>N,)  <  2C/3L 


We  conclude 

(B.2.9)      «"'(|^^.(a|y  ) -w™.  (a  |?". )  |  >2£/3L,n(s  .  ]  y .  )>N)  <  e/L 

whenever  p.(x(i)|^  .)  >  e,      which  proves  (B.2.6),  and  consequently  (B.2.3). 
To  show  (B.2.4)  we  proceed  as  in  (B.2.5):   If  p.(x|tf  .)r)  <  e/2   we  are 
done.   Let  N.   be  large  enough  that   r7(n)  <  c/2   for  n  >  N..  ,   then  choose 
N   so  that  (B.2.2)  holds  for  e'    <   e/2   such  that   (e/2-£')N  >  N  ■ 

Our  next  step  is  to  argue  that  the  players  are  unlikely  to  have  a  large 
but  inaccurate  sample,  so  that  they  are  unlikely  to  be  confident  of  an 
incorrect  forecast.   We  wish  this  to  be  true  uniformly  over  the  population 
fractions  6.,      which  will  follow  from  the  uniform  version  of  the  strong  law 
of  large  numbers,  that  is,  the  Glivenko-Cantelli  theorem. 

Recall  that  6.{Y.)      is  the  steady  state  fraction  of  population  i 
whose  histories   y.   lie  in  Y. .   Because  our  aggregate  system  is  determin- 
istic,  5.(Y.)   is  equal  to  the  expected  frequency  with  which  the  "old"  (age 
T  )   players  encounter  events  in  Y. .   In  particular,  for  a  set  Y.   that 
consists  of  all  subhistories  of  a  set  of  terminal  histories  (i.e.,  histories 
of  length  T  )   and  a  particular  terminal  history  y. ,   define  J(y.,Y.)   to 
be  the  number  of  times  that  a  subhistory  of  y.   lies  in  Y. .   Then  we  have 


(B.A) 


J  ( V .  ,  Y .  )  ^  . 


m  m 


SO  that  to  bound  the  population  fractions,  it  suffices  to  bound  the 
probabilities  of  the  corresponding  length-T  histories. 

Our  goal  is  to  relate  6.      representing  the  population  fractions  with 
each  experience  to  the  fractions   p.   that  determine  the  probability 
distributions  over  observations. 


Lemma  5  .  ^ :   For  all   £  >  0   and  T}(n)    -0   as   n  -  «   there  is  an  N   such 

1     r-  1  T   ^     Tm      ni      , 

that  for  all   T  ,  S    . ,      r. ,   and  s. 
tn     -1     1  1 

e^    (max  p^(x|y^)T7(n(x|y^))  >  (,      and  n(s^|y^)>N)  <  e 

Proof:   Letting  r;  -  sup  rjCn)   by  (B.2.3)  we  may  choose  N   large  enough 

that 


m 


I  Tni    >  II  - 


SA    Jf""    ,      P.(x|y.)-p.(x|fi    .)h>«/2      and     n(s    |y    )>N)    <    c/2 
1  xGa  (Sj^;i  11  -1  11 


so  that  the  conclusion  follows  from  (B.2.4). 


Lemma  5.6:   For  all   «   there  exists  a  7  such  that  for  all  y. ,  6    . ,    r. 

^1   -1    1 


and  T 


m 


Proof:   Fix   g.   and   e   and  let   B  ■  B  (fi  .)•   We  must  find  7   so  that 
"1  e      -1 

regardless  of  y . ,    8    . ,    v .      and  T 
"  •'1-11        m 


1 


fo  gi('f-ilyi)dT-i 

D ^ 


h  h^''-'^ 


dn. 


.<  (. 


Since  g.   is  bounded  away  from  0,   and  £   is  fixed,  it  suffices  to  find  a 


-D 


7'   so  that 

Define   B   -  (w  .1  ||p.(z|e  .)  -  p.CzU  .  )  ||  <  c)   and  recall  that   B  -  n  B  . 
By  (B.1.1)  of  Lemma  B.l,  Lemma  5.4,  and  »*Z>1 ,   we  may  find  an  N   so  that 

6(^{;i|p.(z|7_.)  -  p.(2^_.)||g.(^_.|y.)d^_.  >  2^/3,  n(s.Iy.)  >  N)  <  e/#Z 

Now      I   ■  J||p.(z|ff    .)    -    p.(z|w    Ollg-Cf    ■lyOd"'    .      may  be  written  as    the    sum 
of   integrals   over     B        and     -B    ,      so    in  particular      I   >  If. fh      if 

z 

Since   fg-CTr  .|y.)dfl-  .  -  1,   I  >  li/Z      also  follows  from 
■'"i-i-^i    -1 

r„   E.  (t  .  y.)d5r  .  <  1/3.   Ue  conclude 
•^  B   "1   - 1  '^  1    - 1 

z 

^i*-^B  Si^'^-ilyi)'^'^.!  ^  1/2'  "(s.  |y.)  >  N)  <  £/#Z. 

z 

If  we  take   7'  -  min(l/3,  min    |^    |y  )  <  N)  -^B  Si^'^.  i  lyi)*^'^.  i '  • 

■'  1 '    1  -^  1        z 

z 

Since   (f„g.(»r  .|y.)d?r  .  >  7' )  Cn  (/„  g.(>r  .|y.)d»r  .  >7')   it  follows  that 
•'B°i   -1-^1    -L  z  ■'B  "1   -1-^1    -L 

z 

fi^(Jgg.(7r_.|y.)d,r_.  <  7')  <  £.  ■ 


Lemma  5.7:   Let   T  -♦  «  be  a  sequence  of  lifetimes,  and  6        be  a 
m 

subsequence  of  steady  states  that  converge  to   5,   and  let   r.   be  the 
corresponding  rules.   If  fi.(s.)  >  0,   then 

(5.7.1)      e™(n(s.|y.)  >  N  and  r'!'(y.)  -  s.)  >  ~^.{s.)    -    (N/T  ) 
1    1  -^  1  1-^1     1     11      '  m 

Proof:   Since   ^.(s.)  >  0  there  exists  an  e  >  0  and  m  such  that 
1   1 

fi.(s.)  >  It      for  all  m  >  m.   Now  fix  N.   For  any  history  y. ,   there  are 

at  most  N  subhistories  y!   for  which  r.(y'.)  -  s.,   and  n(s.  y.)  <  N. 

■'  X  1/1     1  i-'i 

Since   fi.(y.)  <  1/T  ,   equation  (B.4)  shows  that 


(5.7.2)        ^"'■(n(s.  !y.)  <  N   and   r"'(v.)  -  s.)  <  N/T 


Since   tf.(s.)   is  the  stun  of  the  fractions  playing  s.   with  n(s.|y.)  <  N" 
and  those  playing   s.   with  n(s.|y.)  >  N,   (5.7.1)  follows.  | 

The  following  lemma  is  used  in  Appendix  C; 

Lemma  B . 3 :   For  all   £  >  0   there  is  an  N   such  that  for  all   T  ,   5  .,   r. 
m     1     L 

and   s.   if   heH  .   and   aeA(h) 

b\    (;ik_.(a)  -  ,r_.(aU_.)||  g.  (t_  .  |  y .  )d^_  .  >  e 

and   n(h|y. )  >  N)  <  e 
Proof:   This  combines  B.2.1  from  Lemma  B.2  with  B.1.2  from  Lemma  B.l.      I 
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Theorem  6.1:   For  fixed  strictly  positive  priors   g.   and   5  <  1   as   T  -  » 
every  sequence  8        of  steady  states  has  an  accumulation  point  6 ;   if 

6.(s.)    >   0   there  exist  beliefs  u.      that  are  confirmed  for   s.   and  6    . 

1   1  1  1        -1 

and  such  that   s.   maximizes   u.(«  u.). 
1  1    1 


Tm 


Proof  of  Theorem  6.1:   Let  6        be  a  subsequence  of  steady  states  that 


converge  to  6,      and  let   r.   be  the  corresponding  optimal  rules.   Suppose 

7.(s.)  >  0. 
1   1 

We  will  say  that   s.   is  a  static  £-best  response  to  marginal  beliefs 
p  .   if,  for  all  s  '.  , 

'^j_(s^,p^)  +  £  >  u^(s^,p^). 

Fix  T)(r\)      so  that  Lemma  5.4  holds.   Fix  e.      By  Lemma  5.2  and  the  fact  that 

m   .     .   , 
r.   IS  optimal 


(6.1.1) 
5XaVv^   ^#Z(s^)Ue/(l-5) 


.{r.(y.)  -  s.,   s.   is  not  an  -*- 


c  best  response  to  p.(»|y.). 


and  n(s.y.)  >  N)  <  e 
1-^1 

Let  X  .(s.)   be  the  nodes  hit  (in  the  limit)  with  positive  probability 

when  s.   is  played:   X  .(s.)  -  (X(s.,fl  .))•   Let 

L       *^   -^         -111-1 

p  ■      113-n      p(x|5  .).   We  may  assume   m   is  large  enough  that 
Si,xeX.i(si)       -1 

p(x|^.)  >  p/2  for  x  G  X  .(s.)  and  |l»r  .(a|^.)  -  tt^  .  (a  |7_ . )  ||  <  t.  Note 
that  for  x  6  h,  n(h|y.)  >  n(x|y.).  Consequently,  by  (5.4.1)  of  Lemma  5.4, 
for  any   «   we  may  find  an  N   so  that 


(6.1.2))    e 


m 
i 


max  _    ,,(n(h|yi))  >  2£/p,  and  n(s  j_  ly^)  >  N 
ieH_inH(si_,5_i) 


'  <  £ 


Ue  mav  cake   N   large  enough  to  satisfy  the  conclusion  of  Lemina  B.3. 


(6.1.3)       «"(  -^^^  J||;r.i(a)  -  7r_i(a|«-i)||gi(7r_i|yi)(Hr_i  >  3£,  and 

ae^_i(si) 

n(s. |y.)  >  N)  <   2c  . 

Since  0.(s.)    >   0,   choosing   m   large  enough  that   N/T  <  c'/2      and 
(^  .  (s. ) -^"'(s. ) )  <  e'/2      it  follows  from  Lemma  5.7  that 

(6.1. A)      e"(n(s.|y.)  >  N  and   r"'(y.)-s.)  >  e'. 

Combining  (6.1.1)  through  (5.1.4)  yields 

(6.1.5)      ^.(r.(y.)  -  s.,   s.   is  a  static   2MUt/(l-(5)   best-response  to 

11   -'ill  ^ 

-■"^^  ;||^      (a)    -    TT      (a|7    .)||g.(x      |y)d;r        <   2e      and     n(s    |y   )    >  N) 

aeA_j^(si)  "1-  "1  -11-1  -1  11 


>   t'    -    2(. 

If  we    take      3c    <   t' ,    we   conclude    that   for   some     y.s.      is   a      2#Z(s . )Uc/(l-5) 

•^11  1 

best   response    to     p • (• 1 g. (• |y . ) )      with 

max     7      ,       ,r||7r    .(a)    -    ir    .(a|7    .)||g.(ir    .|y^)d7r    .    <    3c. 
acA    .(s.)**"    -1  -1  -I'lioi^    -1   •'i         -1 

-1      1 

Taking   «  -»  0 ,   we  see  that  as  a  measure  on  ir    .,      g.(»|y.)   has  a  weak 

limit  point  n .  .      Then  s.   is  a  best  response  to  p. (•!/;.),   that  is, 

maximizes  u. (• ,u.) ,      and 
1    1 

^^^   =7  r      Jh    -(a)  -   ^    .(a|7  .)||  M.(dT  .)  -  0.  I 

aGA  .(s.)    -1        -1     -1  "   1    -1 
-1   1 
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